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Abstract 

Deep neural networks have recently achieved 
state-of-the-art results in many machine learn¬ 
ing problems, e.g., speech recognition or object 
recognition. Hitherto, work on rectified linear 
units (ReLU) provides empirical and theoretical 
evidence on performance increase of neural net¬ 
works comparing to typically used sigmoid ac¬ 
tivation function. In this paper, we investigate 
a new manner of improving neural networks by 
introducing a bunch of copies of the same neu¬ 
ron modeled by the generalized Kumaraswamy 
distribution. As a result, we propose novel non¬ 
linear activation function which we refer to as 
Kumaraswamy unit which is closely related to 
ReLU. In the experimental study with MNIST 
image corpora we evaluate the Kumaraswamy 
unit applied to single-layer (shallow) neural net¬ 
work and report a significant drop in test classifi¬ 
cation error and test cross-entropy in comparison 
to sigmoid unit, ReLU and Noisy ReLU. 

1. Introduction 

Deep neural networks are quickly becoming a crucial ele¬ 
ment of high performance systems in many domains (Ben- 
gio et ah, 2013), e.g., speech recognition, object recogni¬ 
tion, natural language processing, multi-task and domain 
adaptation. Typical neural networks are based on sigmoid 
hidden units (Bengio, 2009), however, they can suffer from 
the vanishing gradient problem (Bengio et ah, 1994). The 
issue may arise when lower layers of a neural network have 
gradients nearly 0 because higher layers are mostly satu¬ 
rated at 0 or 1. The vanishing gradients may drastically 
slow down the optimization procedure and eventually may 
lead to a poor local minimum. 

In order to overcome issues associated with the sigmoid 
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non-linearity it is advocated to utilize other types of hidden 
units. Recently, deep neural networks with rectified linear 
units (ReLU) have seen success in different applications, 
e.g., signal processing (Zeiler et ah, 2013), sentiment anal¬ 
ysis (Glorot et ah, 2011), object recognition (Jarrett et ah, 
2009), image analysis (Nair & Hinton, 2010). It has been 
shown that piece-wise linear units, such as ReLU, can com¬ 
pute highly complex and structured functions (Montufar 
et ah, 2014). The practical success and theoretical results 
on ReLU have indicated a new direction for research. In 
(Maas et ah, 2013) a leaky version of ReLU was proposed. 
The empirical evaluation on speech recognition task has 
shown slight improvement in comparison to sigmoid unit 
and ReLU. Further investigations with parametrized leaky 
ReLU (called Parametric ReLU) in (He et ah, 2015) con¬ 
firmed the presumption that simple ReLU may be to re¬ 
strictive to learn fully successful representation. Recently, 
Agostinelli et ah (2015) went even further and proposed 
Adaptive Piecewise Linear Units (APLU), ReLU with a 
piecewise linear part for negative values. In the experi¬ 
ments it was shown that APLU can lead to significant per¬ 
formance increase. 

In this work, we propose to improve neural networks by 
modeling neurons in a new manner. Our idea is to repli¬ 
cate a neuron with the same weights and biases in order 
to increase the robustness of learning a pattern. More¬ 
over, we take into account the complex structure of sin¬ 
gle neuron which is represented by an additional parame¬ 
ter. A suitable fashion of modeling such bunch of neurons 
is application of a generalized Kumaraswamy distribution 
(Kum-G) (Cordeiro & de Castro, 2011; Nadarajah et ah, 
2012). The Kumaraswamy distribution (Kum) can be seen 
as an alternative distribution to the Beta distribution (Jones, 
2009) and Kum-G determines a new class of distribution 
for given base probability measure. In our case, assuming 
single neuron in the bunch of neurons is modeled by a sig¬ 
moid function, we obtain novel non-linear activation func¬ 
tion which we refer to as Kumaraswamy unit. For properly 
chosen parameters of Kum-G, the Kumaraswamy unit be¬ 
haves similarly to ReLU. 

The contribution of the paper is the following: i) we 
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introduce an original non-linear activation function (Ku¬ 
maraswamy unit) which follows from modeling a bunch 
of copies of the same neuron using the generalized Ku¬ 
maraswamy distribution, ii) we provide close relationship 
between the Kumaraswamy unit and ReLU, iii) we provide 
an empirical evaluation of a single-layer neural network 
with the proposed hidden unit applied to MNIST dataset. 


2. Modeling bunch of neurons: 

Kumaraswamy unit 

Preliminaries Let us focus on conventional feed-forward 
neural network with an input (visibles) v £ [0, l] £)xl and 
an output y £ {0, l} /fxl such that Efc 2/fc = 1- In gen- 
eral, the network consists of L hidden layers, however, 
we restrict our considerations to one hidden layer for clar¬ 
ity. Therefore, the parameters of the network are 9 = 
{c, d,W,U}, where c £ R Mxl and d £ denote 

hidden and output biases, respectively, and W £ R M x D 
are input-to-hidden weights, and U £ R K x M are hidden- 
to-output weights. The output of the network is modeled 
by the softmax unit: 


p(Vk = l|v,0) 


exp(Ufc./(v; c, W) + d k ) 
E i exp(U/./(v; c, W) + d t ) 


(1) 


where Uj. denotes ?-th row of the matrix U, and 
/(v;c, W) is the M -dimensional output of the hidden 
layer. 

The activity of hidden units is modeled by some element¬ 
wise non-linear function /(•) . Typical activation function 
is the sigmoid function: 


a ( x ) 


1 

1 + exp(— x) 


( 2 ) 


Recently, several alternatives to the sigmoid function have 
been used in numerous applications, such as, rectified lin¬ 
ear unit (ReLU) (Jarrett et ah, 2009): 


r(x) = max{0, x}, 


(3) 


or noisy rectified linear unit (Noisy ReLU) (Nair & Hinton, 

2010 ): 

n(x) = max{0, x + A/"(0, v)}, (4) 

where A/"(■, •) is a normal probability density function with 
zero mean and variance v. 


Bunch of neurons Let us assume that activation of a neu¬ 
ron is modeled by a sigmoid unit. Moreover, let us presume 
that each neuron consists of a independent elements and 
there are b independent copies of the same neuron. There¬ 
fore, instead of single neuron we consider a bunch of neu¬ 
rons that try to reflect one pattern. A similar idea with repli¬ 
cas of a neuron was introduced in (Teh & Hinton, 2001) 



Figure 1. A comparison of chosen non-linear activation functions 
used in a hidden layer of a neural network. The green curve cor¬ 
responds to typically used sigmoid function. The red curve rep¬ 
resents ReLU non-linearity. The black curve shows the expected 
value of Noisy ReLU with v = I. The magenta and blue curves 
depict the Kumaraswamy units with different values of scale pa¬ 
rameters, (5, 6) and (8, 30), respectively. 


where the replication of sigmoid hidden units with the same 
weights and biases led to binomial units. Further, it turned 
out that binomial units with fixed offset to biases resulted in 
softplus units and its fast approximation, i.e.. Noisy ReLU 
(Nair & Hinton, 2010). However, in our approach we copy 
the sigmoid hidden unit b times and additionally we intro¬ 
duce a second parameter, a, which corresponds to modeling 
complexity of the neuron itself. Increasing the value of b re¬ 
sults in higher robustness of the bunch of neurons because 
it is less probable that the input signal will not activate any 
hidden unit in the bunch. On the other hand, increasing the 
value of a leads to higher failure probability of the neuron 
activation because it suffices that at least one element fails 
to deactivate the whole neuron. 

It turns out that a suitable manner of modeling a bunch of 
neurons as described above is a generalized Kumaraswamy 
distribution (Cordeiro & de Castro, 2011; Nadarajah et ah, 
2012). The generalized Kumaraswamy distribution (Kum- 
G) for given base distribution G(x) with a probability den¬ 
sity function g(x) is defined as follows: 

K G (x\a,b) = l-(l-G(x) a ) b , (5) 

where a > 0 and b > 0 are shape parameters. The proba¬ 
bility density function of Kq(x |a, b) has a simple form: 

ka{x\a, b) = ab g(x)G(x) a ~ 1 ( 1 — G(x) a ) b_1 . (6) 
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For integer-valued shape parameters a and b Kum-G has 
a nice interpretation of a system which consists of b inde¬ 
pendent components and each component is made up of 
a independent subcomponents. Kum-G perfectly fits to 
modeling chosen property of a complex system, such as, 
lifetime of an entire system (Nadarajah et al., 2012). We 
can clearly apply Kum-G to represent the bunch of neu¬ 
rons. 

Kumaraswamy unit Assuming that single neuron acti¬ 
vates according to the sigmoid function, we can take ad¬ 
vantage of Kum-G to model the bunch of neurons which 
yields a new kind of non-linear activation function: 

K rJ (x\a,b) = l-{l-a{x) a ) b . (7) 

We refer this resulting unit to as Kumaraswamy unit 
(Kumaraswamy(a, b )). Obviously, for a = b = 1 one re¬ 
covers the sigmoid activation function. In the context of 
gradient-based learning algorithm it is important to com¬ 
pute a derivative of a hidden unit. Since the derivative of 
the sigmoid function can be easily calculated, the deriva¬ 
tive of the Kumaraswamy unit can be obtained immediately 
(see Equation 6). 

An intriguing property of the Kumaraswamy unit is that 
for properly chosen values of a and b it can behave like 
ReLU (see Figure 1 for comparison of sigmoid unit, ReLU, 
Noisy ReLu and the Kumaraswamy unit). We consider 
only two pairs of values of shape parameters, namely, 
(a, b) £ {(5,6), (8, 30)}. The Kumaraswamy unit with 
a = 5 and b = 6 is the closest 1 approximation of ReLU for 
value 0.5, while the second pair of values gives the clos¬ 
est approximation of ReLU in points 0.25 and 0.75. As we 
can evidently notice in Figure 1, the Kumaraswamy unit 
can behave similarly to ReLU but it returns values between 
0 and 1 like the sigmoid function. We argue that such be¬ 
havior may be crucial in training a neural network and is 
more biologically plausible. 

Training Learning the parameters of the network 6 for 
given data V = {(v n , t n )}n~i is performed by minimizing 
the cross-entropy loss. In the case of the neural network 
with output given by the softmax unit, the cross-entropy 
loss is equivalent to the negative conditional log-likelihood 
function: 

w = -EE £fcnlogp(t/fc„|v„,0). (8) 

n k 

3. Experiments 

Goal In the experiment we aim at answering the follow¬ 
ing question: 

Tn the Euclidean sense. 


• Is the Kumaraswamy unit preferable to sigmoid unit, 
ReLU or Noisy ReLU in training a single-layer neural 
network using stochastic gradient descent? 

We want to point out that we try to verify only the impact of 
the proposed non-linearity on learning the neural network. 
We believe that positive answer to the stated question will 
give us a good starting point for further experiments with 
multi-layered neural networks, unsupervised pre-training 
and more sophisticated training techniques. 

Data In the experiment we deal with the well-known 
MNIST dataset 2 for hand-written digit classification. We 
split the dataset to 50,000 training images, 10,000 objects 
for validation, and 10,000 test examples. Each image con¬ 
sists of 28 x 28 pixels (D = 784), and is labeled as one of 
ten possible digits (K = 10). 

Learning methodology In the experiment we focus on a 
single-layer neural network with 500 hidden units (M = 
500). We train the model using stochastic gradient descent 
with momentum and a mini-batch size of 100 examples. 
The initial value of the momentum term is set according to 
the model selection with possible values in {0,0.5}, and af¬ 
ter 50 epochs it is set to 0.9. The learning rate is determined 
in the model selection procedure with possible values in 
{0.001,0.01, 0.1}. During learning process, we apply the 
learning step policy in which the learning step is divided by 
2 after each 10 epochs. Additionally, we add a weight de¬ 
cay to the learning objective and we explore the following 
values of the regularization coefficient: {0,10 -5 ,10~ 4 }. 
The maximum number of epochs is set to 100. The num¬ 
ber of iterations over the training set is determined using 
early stopping according to the validation classification er¬ 
ror, with a look ahead of 10 epochs. For all activation units 
the same initialization of the parameters is applied. 

Evaluation methodology In order to verify whether the 
Kumaraswamy unit is preferable to the sigmoid unit, ReLU 
or Noisy ReLU, we use two evaluation metrics, namely, 
the test classification error (Error) and the test cross¬ 
entropy loss (Cross-Entropy ). 3 Additionally, we measure 
the mean activity of hidden units. 

Results The results for the considered activation non¬ 
linear activation functions are gathered in Table 1. A sin¬ 
gle run of a training procedure for the considered units is 
presented in Figure 2. Moreover, in order to get better in¬ 
sight into the trained representation by the considered non- 
linearities input-to-hidden weights are depicted in Figure 3 

2 http: //y ann .lecun. com/exdb/mnist/ 

3 In the experiment we report the test cross-enhopy loss di¬ 

vided by the number of test examples. 
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and the mean activities of hidden units are demonstrated in 
Figure 4. 


Table 1. Test classification error and test cross-entropy loss for 
single-layer neural network with different hidden units on MNIST 
averaged over 5 experiment runs. The best results are in bold. 


Hidden unit 

Error[%] 

Cross-Entropy 

sigmoid 

5.85 

0.21 

ReLU 

5.44 

0.19 

Noisy ReLU 

5.79 

0.21 

Kumaraswamy(5,6) 

5.18 

0.17 

Kumaraswamy(8,30) 

4.87 

0.16 


Discussion According to the results (see Table 1) we 
can conclude that the Kumaraswamy unit indeed tends 
to be preferable in comparison to sigmoid unit, ReLU 
and Noisy ReLU. The Kumaraswamy unit obtained the 
best results in terms of the test classification error and 
the test cross-entropy loss. The comparison of the two 
considered possible values of the scale parameters reveals 
that the Kumaraswamy(8,30) performs better than the 
Kumaraswamy(5,6), i.e., the bunches of 30 neurons with 
8 elements seem to be more suitable in training a neural 
network. 

It is a well-known fact that application of ReLU can cause 
saturation of some hidden units. However, we notice that 
the saturation of several hidden units resulted in faster 
convergence of ReLU and Noisy ReLU in comparison to 
other activation non-linearities (see Figure 2). For instance, 
ReLU obtained the minimum after 18 epochs while the sig¬ 
moid neural network converged after 65 iterations. It is 
worth noticing that the Kumaraswamy unit obtained the the 
best result (better by about 3% in comparison to others) just 
after 10 epochs. 

In Figure 3 the learned features (input-to-hidden weights) 
for the considered types of hidden units are depicted. The 
features learned by the sigmoid neural network represent 
different patterns but some of them seem to be redun¬ 
dant, e.g., there are many patterns which look like diagonal 
stroke. On the contrary, ReLU and Noisy ReLU allow to 
obtain more diverse features. However, there are many neu¬ 
rons which are saturated, but in the case of Noisy ReLU this 
effect is less evident. Slightly different features in shape are 
obtained by Kumaraswamy units. These are more similar 
to the ones learned by ReLU, nevertheless, they are not sat¬ 
urated or degenerated as in the case of sigmoid units. 

Next, we investigate the impact of different non-linearities 
on activity of hidden units. The histograms of mean ac¬ 
tivity of neurons measured on the test set given in Fig¬ 
ure 4 indicate that on average the ReLU, Noisy ReLU 


and Kumaraswamy units lead to larger number of neu¬ 
rons with activation smaller than 0.01 in comparison to 
the sigmoid units, namely, ReLU: 150, Noisy ReLU: 19, 
Kumaraswamy(5,6): 7, Kumaraswamy(8,30): 8, sig¬ 
moid: 0. However, Kumaraswamy unit has two advantages 
over ReLU and Noisy ReLU. First, there are less saturated 
neurons. Second, all Kumaraswamy units have mean ac¬ 
tivity less than 0.5 while in the case of ReLU and Noisy 
ReLU there are some units with very strong activation, i.e., 
above 0.5 or even larger than 1. 



Figure 2. Validation classification error against the number of 
epochs during learning process for the considered types of hid¬ 
den units. 


4. Conclusion 

In this work we introduced a new idea of improving neural 
networks with bunch of neurons, i.e., replicas of the same 
neurons which consist of independent elements. The bunch 
of neurons can be easily modeled by the generalized Ku¬ 
maraswamy distribution which resulted in a formulation 
of new non-linear activation function we refer to as Ku¬ 
maraswamy unit. A nice property of the Kumaraswamy 
unit is that for properly chosen parameters it can be ap¬ 
proximately shaped as ReLU and it returns values between 
0 and 1. In the experiment the performance of the neural 
network with the Kumaraswamy unit was compared with 
other activation functions, namely, sigmoid unit, ReLU and 
Noisy ReLU. The obtained results on MNIST seem to con¬ 
firm the supremacy of the Kumaraswamy unit, nonethe¬ 
less, this statement needs to be confirmed with more thor¬ 
ough studies. We believe that the performed experiment 
gives a good starting point for further research on the Ku¬ 
maraswamy units applied to deep neural networks. 
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Figure 3. A visualization of first 100 of the input-to-hidden weights (features). It is apparent that the patterns learned by the neural 
network with the considered in the paper units differ in shapes. As expected, application of ReLU results in saturation of several hidden 
units. The same outcome, but less evident, can be observed in the case of the Noisy ReLU. The Kumaraswamy unit do not lead to 
saturated or degenerated hidden units. 
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(a) Sigmoid 



(b) ReLU 



(c) Noisy ReLU 



(d) Kumaraswamy(5,6) (e) Kumaraswamy(8,30) 


Figure 4. Mean activity of hidden units calculated on the test set. The sigmoid neural network has all hidden units with non-zero activity 
and some neurons are very active (above 0.5). Application of ReLU results in many saturated hidden units while the Noisy ReLU 
alleviates this effect. Nonetheless, for ReLU and Noisy ReLU there are some hidden units with very strong response (i.e. above 1). The 
utilization of Kumaraswamy units leads to a representation in which all neurons have activity less than 0.5 on average. 
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